arXiv:1508.05056v2 [cs.MM] 24 Aug 2015 


Diving Deep into Sentiment: Understanding Fine-tuned 
CNNs for Visuai Sentiment Prediction 


Victor Campos Amaia Salvador Brendan Jou Xavier Giro-i-Nieto 


Universitat Politecnica de Catalunya (UPC), Barcelona, Catalonia/Spain 
Columbia University, New York, NY USA 

victor.campos.camunez(a)alu-etsetb.upc.eclu, {amaia.salvador,xavier.giro}@upc.edu, 

bjou@ee.columbia.edu 


ABSTRACT 

Visual media are powerful means of expressing emotions and 
sentiments. The constant generation of new content in so¬ 
cial networks highlights the need of automated visual senti¬ 
ment analysis tools. While Convolutional Neural Networks 
(CNNs) have established a new state-of-the-art in several 
vision problems, their application to the task of sentiment 
analysis is mostly unexplored and there are few studies re¬ 
garding how to design CNNs for this purpose. In this work, 
we study the suitability of fine-tuning a CNN for visual sen¬ 
timent prediction as well as explore performance boosting 
techniques within this deep learning setting. Finally, we 
provide a deep-dive analysis into a benchmark, state-of-the- 
art network architecture to gain insight about how to design 
patterns for CNNs on the task of visual sentiment predic¬ 
tion. 

Categories and Subject Descriptors 

H. 1.2 [Models and Principles]: User/Machine Systems; 

I. 2.10 [Artificial Intelligence]: Vision and Scene Under¬ 
standing 

Keywords 

Sentiment; Convolutional Neural Networks; Social Multime¬ 
dia; Fine-tuning Strategies 

1. INTRODUCTION 

The recent growth of social networks has led to an ex¬ 
plosion in amount, throughput and variety of multimedia 
content generated every day. One reason for the richness 
of this social multimedia content comes from how it has be¬ 
come one of the principal ways that users share their feelings 
and opinions about nearly every sphere of their lives. In par¬ 
ticular, visual media, like images and videos, have risen as 
one of the most pervasively used and shared documents in 
which emotions and sentiments are expressed. 



Figure 1: Overview of the presented system for vi¬ 
sual sentiment prediction. 


The advantages of having machines capable of understand¬ 
ing human feelings are numerous and would imply a revo¬ 
lution in fields such as robotics, medicine or entertainment. 
Some interesting preliminary applications are already begin¬ 
ning to emerge, e.g. for emotional understanding of viewer 
responses to advertisements using facial expressions [15| . 
However, while machines are approaching human perfor¬ 
mance on several recognition tasks, such as image classi¬ 
fication [^, the task of automatically detecting sentiments 
and emotions from images and videos still presents many 
unsolved challenges. Numerous approaches towards bridg¬ 
ing the affective gap, or the conceptual and computational 
divide between low-level features and high-level affective se¬ 
mantics, have been presented over the years for visual multi- 
media (Ulljllli, but the performance has remained fairly 
conservative and related intuitions behind this have been 
lacking. 

Promising results obtained using Convolutional Neural 
Networks (CNNs) in many fundamental vision tasks 
have led us to consider the efficacy of such machinery for 
higher abstraction tasks like sentiment analysis, i.e. classi¬ 
fying the visual sentiment (either positive or negative) that 
an image provokes to a human. Recently, some works |27| 
explored CNNs for the task of visual sentiment analysis 
and obtained some encouraging results that outperform the 
state of the art, but develop very little intuition and analysis 
into the CNN architectures they used. Our work focuses on 
acquiring insight into fine-tuned layer-wise performance of 
CNNs in the visual sentiment prediction setting. We address 
the task of assessing the contribution of individuals layers 







in a state-of-the-art fine-tuned CNN architecture for visual 
sentiment prediction. 

Our contributions include: (1) a visual sentiment pre¬ 
diction framework that outperforms the state-of-the-art ap¬ 
proach on an image dataset collected from Twitter using a 
fine-tuned CNN, (2) a rigorous analysis of layer-wise perfor¬ 
mance in the task of visual sentiment prediction by training 
individual classifiers on feature maps from each layer in the 
former CNN, and (3) network architecture surgery applied 
to a fine-tuned CNN for visual sentiment prediction. 


2. RELATED WORK 

Several approaches towards overcoming the gap between 
visual features and affective semantic concepts can be found 
in the literature. In [^, the authors explore the poten¬ 
tial of two low-level descriptors common in object recogni¬ 
tion, Color Histograms (LCH, GCH) and SIFT-based Bag- 
of-Words, for the task of visual sentiment prediction. Some 
other works have considered the use of descriptors inspired 
by art and psychology to address tasks such as visual emo¬ 
tion classification or automatic image adjustment to¬ 
wards a certain emotional reaction [^. In a Visual 
Sentiment Ontology based on psychology theories and web 
mining consisting of 3,000 Adjective Noun Pairs (ANP) is 
built. These ANPs serve as a mid-level representation that 
attempt to bridge the affective gap, but they are very de¬ 
pendent on the data that was used to build the ontology and 
are not completely suitable for domain transfer. 

The increase in computational power in GPUs and the 
creation of large image datasets such as have allowed 
Deep Convolutional Neural Networks (CNNs) to show out¬ 
standing performance in computer vision challenges [H 22 
[^. And despite requiring huge amounts of training samples 
to tune their millions of parameters, CNNs have proved to 
be very effective in domain transfer experiments 16 . This 


interesting property of CNNs is applied to the task of vi¬ 
sual sentiment prediction in [25| , where the winning archi¬ 
tecture of ILSVRC 2012 [11] (5 convolutional and 3 fully 
connected layers) is used as a high-level attribute descrip¬ 
tor in order to train a sentiment classifier based on Logistic 
Regression. Although the authors do not explore the pos¬ 
sibility of fine-tuning, they show how the off-the-shelf CNN 
descriptors outperform hand-crafted low-level features and 
SentiBank [^. Given the distinct nature of visual sentiment 
analysis and object recognition, the authors in explore 
the possibility of designing a new architecture specific for the 
former task, training a network with 2 convolutional and 4 
fully connected layers. However, there is very little ratio¬ 
nale given for why they configured their network in this way 
except for the last two fully connected layers. Our work fo¬ 
cuses on fine-tuning a CNN for the task of visual sentiment 
prediction and later performing a rigorous analysis of its ar¬ 
chitecture, in order to shed some light on the problem of 
CNN architecture designing for visual sentiment analysis. 


3. METHODOLOGY 

The Convolutional Neural Network architecture employed 
in our experiments is CaffeNet^ a slight modification of the 
ILSVRC 2012 winning architecture, AlexNet [^. This net¬ 
work, which was originally designed and trained for the task 
of object recognition, is composed by 5 convolutional layers 
and 3 fully connected layers. The two first convolutional lay¬ 


ers are followed by pooling and normalization layers, while a 
pooling layer is placed between the last convolutional layer 
and the first fully connected one. The experiments were 
performed using Cajfe , a publicly available deep learning 
framework. 

We adapted CaffeNet to a sentiment prediction task us¬ 
ing the Twitter dataset collected and published in [^. This 
dataset contains 1,269 images labeled into positive or nega¬ 
tive by 5 different annotators. The choice was made based 
on the fact that images in Twitter dataset are labeled by 
human annotators, oppositely to other annotation methods 
which rely on textual tags or predefined concepts. There¬ 
fore, the Twitter dataset is less noisy and allows the models 
to learn stronger concepts related to the sentiment that an 
image provokes to a human. Given the subjective nature of 
sentiment, different subsets can be formed depending on the 
number of annotators that agreed on their decision. Only 
images that built consensus among all the annotators (5- 
agree subset) were considered in our experiments. The re¬ 
sulting dataset is formed by 880 images (580 positive, 301 
negative), which was later divided in 5 different folds to eval¬ 
uate experiments using cross-validation. 

Each of the following subsections is self-contained and de¬ 
scribes a different set of experiments. Although the training 
conditions for all the experiments were defined as similar as 
possible for the sake of comparison, there might be slight 
differences given each individual experimental setup. For 
this reason, every section contains the experiment descrip¬ 
tion and its training conditions as well. 


3.1 Fine-tuning CaffeNet 

The adopted CaffeNet architecture contains more than 
60 million parameters, a figure too high for training the net¬ 
work from scratch with the limited amount of data available 
in the Twitter dataset. Given the good results achieved by 
previous works about transfer learning |16| 20 , we decided 
to explore the possibility of fine-tuning an already exist¬ 
ing model. Fine-tuning consists in initializing the weights 
in each layer except the last one with those values learned 
from another model. The last layer is replaced by a new 
one, usually containing the same number of units as classes 
in the dataset, and randomly initializing their weights be¬ 
fore “resuming” training but with inputs from the target 
dataset. The advantage of this approach compared to fully 
re-training a network from a random initialization on all 
the network weights is that it essentially starts the gradient 
descent learning from a point much closer to an optimum, 
reducing both the number of iterations needed before con¬ 
vergence and decreasing the likelihood of overfitting when 
the target dataset is small. 

In our sentiment analysis task, the last layer from the orig¬ 
inal architecture, fc8, is replaced by a new one composed of 
2 neurons, one for positive and another for negative senti¬ 
ment. The model of CaffeNet trained using ILSVRC 2012 
dataset is used to initialize the rest of parameters in the net¬ 
work for the fine-tuning experiment. Results are evaluated 
using 5-fold cross-validation. They are all fine-tuned during 
65 epochs (that is, every training image was seen 65 times by 
the CNN), with an initial base learning rate of 0.001 that is 
divided by 10 every 6 epochs. As the weights in the last layer 
are the only ones which are randomly initialized, its learning 
rate is set to be 10 times higher than the base learning rate 
in order to provide a faster convergence rate. 
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Figure 2: Experimental setup for the layer analysis 
using linear classifiers. The number between brack¬ 
ets next to fully connected layer makes reference to 
the amount of neurons they contain. 


A common practice when working with CNNs is data 
augmentation, consisting of generating different versions of 
an image by applying simple transformations such as flips 
and crops. Recent work has proved that this technique re¬ 
ports a consistent improvement in accuracy . We explored 
whether data augmentation improves the spatial generaliza¬ 
tion capability of our analysis by feeding 10 different combi¬ 
nation of flips and crops of the original image to the network 
in the test stage. The classification scores obtained for each 
combination are fused with an averaging operation. 

3.2 Layer by layer analysis 

Despite the outstanding performance of CNNs in many 
vision tasks, there is still little intuition into how to design 
them. In order to gain some insight about the contribution 
of each individual layer to the the task of visual sentiment 
prediction, we performed an exhaustive layer-per-layer anal¬ 
ysis of the fine-tuned network. 

The outputs of individual layers have been previously used 
as visual descriptors |19[|20] , where each neuron’s activation 
is seen as a component of the feature vector. Tradition¬ 
ally, top layers have been selected for this purpose as 
they are thought to encode high-level information. We fur¬ 
ther explore this possibility by using each layer as a feature 
extractor and training individual classifiers for each layer’s 
features (see Figure]^. This study allows measuring the 
difference in accuracy between layers and gives intuition not 
only about how the overall depth of the network might affect 
its performance, but also about the role of each type of layer, 
i.e. CONV, POOL, NORM and FC, and their suitability for 
visual sentiment prediction. 

Neural activations in fully connected layers can be rep¬ 
resented as d-dimensional vectors, being d the amount of 
neurons in the layer, so no further manipulation is needed. 
This is not the case of earlier layers, i.e. CONV, NORM, 
and POOL, whose feature maps are multidimensional, e.g. 
feature maps from conv5 are 256x13x13 dimensional. These 
feature maps were flattened into d-dimensional vectors be¬ 
fore using them for classification purposes. Two different 
linear classifiers are considered: Support Vector Machine 
with linear kernel and Softmax. The same 5-fold cross- 
validation procedure followed in the previous experiment 
is employed, training independent classifiers for each layer. 
Each classifier’s regularization parameter is optimized by 
cross-validation. 


3.3 Layer ablation 

More intuition about the individual contribution of each 
layer can be gained by modifying the original architecture 
prior to training. This task is addressed by fine-tuning al¬ 
tered versions of the original CaffeNet where top layers had 
been successively removed. 

Different approaches to the layer removal problem might 
be taken, depending on the changes made to the remaining 
architecture. In our experiments, two different strategies are 
adopted: (1) a raw ablation by keeping the original configu¬ 
ration and weights for the remaining layers, and (2) adding a 
2-neuron layer as a replacement to the removed one, on top 
of the remaining architecture and just before the Softmax 
layer. A more detailed definition of the experimental setup 
for each configuration is described in the following subsec¬ 
tions. 

3.3.1 Raw ablation 

In this set of experiments, the Softmax layer is placed on 
top of the remaining architecture, e.g. if fc8 and fc7 are 
removed, the output of fc6 is connected to the input of the 
Softmax layer. For the remaining layers, weights from the 
original model are kept as well. 

The configurations studied in our experiments include ver¬ 
sions of CaffeNet where (1) fe8 has been ablated, and (2) 
both fe8 and fe7 have been removed (architectures fe7-4096 
and fe6-4096, respectively, in Figure]^. The models are 
trained during 65 epochs, with a base learning rate of 0.001 
that is divided by 10 every 6 epochs. With this configuration 
all the weights are initialized using the pre-trained model, so 
random initialization of parameters is not necessary. Given 
this fact, there is no need to increase the individual learning 
rate of any layer. 

3.3.2 2-neuron on top 

As described in Section 3.1, fine-tuning consists in replac¬ 
ing the last layer in a net by a new one and use the weights in 
a pre-trained model as initialization for the rest of layers. In¬ 
spired by this procedure, we decided to combine the former 
methodology with the layer removal experiments: instead of 
leaving the whole remaining architecture unmodified after 
a layer is removed, its last remaining layer is replaced by a 
2-neuron layer with random initialization of the weights. 

This set of experiments comprises the fine-tuning of mod¬ 
ified versions of CaffeNet where (1) fe8 has been removed 
and fe7 has been replaced by a 2-neuron layer, and (2) fe8 
and fe7 have been ablated and fed has been replaced by a 
2-neuron layer (architectures fc7-2 and fed-2, respectively, 
in Figure]^. The models are trained during 65 epochs, di¬ 
viding the base learning rate by 10 every 6 epochs and with 
a learning rate 10 times higher than the base one for the 2- 
neuron layer, as its weights are being randomly initialized. 
The base learning rate of the former configuration is 0.001, 
while the latter’s was set to 0.0001 to avoid divergence. 

3.4 Layer addition 

None of the architectures that have been introduced so far 
takes into account the information encoded in the last layer 
{fe8) of the original CaffeNet model. This layer contains a 
confidence value for the image belonging to each one of the 
1,000 classes in ILSVRC 2012. In addition, fully connected 
layers contain, by far, most of the parameters in a Deep 
Convolutional Neural Network. Therefore, from both of the 
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Figure 3: Layer ablation architectures. Networks fc7-^09 6 and f c6-^096 keep the original configuration after 
ablating the layers in the top of the architecture (Section |3.3.1 ), while in fc7-2 and fc6-2 the last remaining 
layer is replaced by a 2-neuron layer (as described in Section 3.3.2). The number between brackets next to 
fully connected layer makes reference to the amount of neurons they contain. 


Table 1: 5-fold cross-validation results on 5-agree 
Twitter dataset _ 


Model 

Accuracy 

Fine-tuned CNN from You et al. !27] 

0.783 

Fine-tuned CaffeNet 

0.817 ± 0.038 

Fine-tuned CaffeNet with oversampling 

0.830 ± 0.034 


former points of view, a remarkable amount of information is 
being lost when discarding the original fc8 layer in CaffeNet 

Similarly to the procedure followed in the layer removal 
experiments, two different approaches are considered in or¬ 
der to take advantage of the information in the original fc8 : 
(1) the original CaffeNet architecture is fine-tuned, keep¬ 
ing the original configuration and weights for fe8, and (2) 
a 2-neuron layer {fe9) is added on top of the original ar¬ 
chitecture {siTchitectViTes fe8-1000 and fe9-2, respectively, in 
Figure]^. Models are trained during 65 epochs, with a base 
learning rate of 0.001 that is divided by 10 every 6 epochs. 
The only layer that has a higher individual learning rate is 
the new fe9 in configuration fe9-2, which is set to be 10 times 
higher than the base learning rate, given that its weights are 
randomly initialized. 

4. EXPERIMENTAL RESULTS 

This section presents the results for the experiments pro¬ 
posed in the previous section, as well as intuition and con¬ 
clusions. 

4.1 Fine-tuning CaffeNet 

Average accuracy results over the 5 folds for the fine- 
tuning experiment are presented in Table which also in¬ 
cludes the results for the best fine-tuned model in . This 



fc8-1000 fc9-2 

Figure 4: Architectures using the information con¬ 
tained in the original fc8 layer and weights. Config¬ 
uration fc8-1000 reuses the whole architecture and 
weights from CaffeNet^ while fc9-2 features an addi¬ 
tional 2-neuron layer. The number between brackets 
next to fully connected layer makes reference to the 
amount of neurons they contain. 


Table 2: Layer analysis with linear classifiers: 5-fold 
cross-validation results on 5-agree Twitter dataset 


Layer 

SVM 

Softmax 

fc8 

0.82 ± 0.055 

0.821 ± 0.046 

fc7 

0.814 ± 0.040 

0.814 ± 0.044 

fc6 

0.804 ± 0.031 

0.81 ± 0.038 

pools 

0.784 ± 0.020 

0.786 ± 0.022 

eonvS 

0.776 ± 0.025 

0.779 ± 0.034 

eonv4 

0.794 ± 0.026 

0.781 ± 0.020 

eonvS 

0.752 ± 0.033 

0.748 ± 0.029 

norm2 

0.735 ± 0.025 

0.737 ± 0.021 

pool2 

0.732 ± 0.019 

0.729 ± 0.022 

eonv2 

0.735 ± 0.019 

0.738 ± 0.030 

norml 

0.706 ± 0.032 

0.712 ± 0.031 

pooll 

0.674 ± 0.045 

0.68 ± 0.035 

eonvl 

0.667 ± 0.049 

0.67 ± 0.032 


CNN, with a 2CONV-4FC architecture, was designed specif¬ 
ically for visual sentiment prediction and trained using al¬ 
most half million sentiment annotated images from Flickr 
dataset [^. The network was finally fine-tuned on the Twit¬ 
ter 5-agree dataset with a resulting accuracy of 0.783 which 
is, to best of our knowledge, the best result on this dataset 
so far. 

Surprisingly, fine-tuning a net that was originally trained 
for object recognition reported higher accuracy in visual sen¬ 
timent prediction than a CNN that was specifically trained 
for that task. On one hand, this fact suggests the impor¬ 
tance of high-level representations such as semantics in vi¬ 
sual sentiment prediction, as transferring learning from ob¬ 
ject recognition to sentiment analysis actually produces high 
accuracy rates. On the other hand, it seems that visual sen¬ 
timent prediction architectures also benefit from a higher 
amount of convolutional layers, as suggested by for the 
task of object recognition. 

Averaging the prediction over modified versions of the in¬ 
put image results in a consistent improvement in the predic¬ 
tion accuracy. This behavior, which was already observed by 
the authors of when addressing the task of object recog¬ 
nition, suggests that the former procedure also increases the 
network’s generalization capability for visual sentiment anal¬ 
ysis, as the final prediction is far less dependent on the spa¬ 
tial distribution of the input image. 

4.2 Layer by layer analysis 

The results of the layer-by-layer analysis of the fine-tuned 
CaffeNet are presented in Table both for the SVM and 
Softmax classifiers. 

Recent works have studied the suitability of Support Vec- 








































































Table 3: Layer ablation: 5-fold cross-validation re¬ 
sults on 5-agree Twitter dataset. 


Architecture 

Without oversampling 

With oversampling 

fc7-4096 

0.759 ± 0.023 

0.786 ± 0.019 

fc6-4096 

0.657 ± 0.040 

0.657 ± 0.040 

fc7-2 

0.784 ± 0.024 

0.797 ± 0.021 

fc6-2 

0.651 ± 0.044 

0.676 ± 0.029 


tor Machines for classification using deep learning descrip¬ 
tors while others have also replaced the Softmax loss 
by a SVM cost in the network architecture [^. Given the 
results of our layer-wise analysis, it is not possible to claim 
that any of the two classihers provides a consistent gain 
compared to the other for visual sentiment analysis, at least 
in the Twitter 5-agree dataset with the proposed network 
architecture. 

Accuracy trends at each layer reveal that the depth of the 
networks contributes to the increase of performance. Not 
every single layer produces an increase in accuracy with re¬ 
spect to the previous one, but even in those stages it is hard 
to claim that the architecture should be modified as higher 
layers might be benefiting from its effect, e.g. conv5 and 
pools report lower accuracy rates than earlier conv4 when 
their feature maps are used for classification, but later fully 
connected layers might be benehting from the effect of convS 
and pools as all of them report higher accuracy than conv4- 

An increase in performance is observed with each fully 
connected layer, as every stage introduces some gain with 
respect to the previous one. This fact suggests that adding 
additional fully connected layers might report even higher 
accuracy rates, but further research is necessary to evaluate 
this hypothesis. 

4.3 Layer ablation 

The four ablation architectures depicted in Figure are 
compared in Table These results indicate that replacing 
the last remaining layer by a 2-neuron fully connected layer 
is a better solution than reusing the information of existing 
layers from a much higher dimensionality. One reason for 
this behavior might be the amount of parameters in each 
architecture, as replacing the last layer by one with just 
2 neurons produces a huge decrease in the parameters to 
optimize and, given the reduced amount of available training 
samples, that reduction can become benehcial. 

Accuracy is considerably reduced when ablating fc7 and 
setting fc6 to be the last layer, independently of the method 
that was used. Further research revealed that models learned 
for architecture fc6-4096 always predict towards the major¬ 
ity class, i.e. positive sentiment, which is justified by the 
reduced amount of training data. This behavior is not ob¬ 
served in architecture fc6-2, where the amount of parameters 
is highly reduced in comparison to fc6-4096, but its perfor¬ 
mance is still very poor. Nevertheless, this result is somehow 
expected, as the convergence from a vector dimensionality 
9,216 in pools to a layer with just 2 neurons might be too 
sudden. These observations suggest that a single fully con¬ 
nected layer might not be useful for the addressed task. 

Finally, it is important to notice that networks which are 
fine-tuned after ablating/c5, i.e. architecturesand 
fc7-2, provide accuracy rates which are very close to the fine- 
tuned CNN in 27 or even higher. These results, as shown by 
the authors in 28 for the task of object recognition, suggest 


Table 4: Layer addition: 5-fold cross-validation re¬ 
sults on 5-agree Twitter dataset. 


Architecture 

Without oversampling 

With oversampling 

fc8-1000 

0.723 ± 0.041 

0.731 ± 0.036 

fc9-2 

0.795 ± 0.023 

0.803 ± 0.034 


that removing one of the fully connected layers (and with 
it, a high percentage of the parameters in the architecture) 
only produces a slight deterioration in performance, but the 
huge decrease in the parameters to optimize might allow the 
use of smaller datasets without overfitting the model. This 
is a very interesting result for visual sentiment prediction 
given the difficulty of obtaining reliable annotated images 
for such task. 

4.4 Layer addition 

The architectures that keep fc8 are evaluated in Table 

indicating that architecture fc9-2 outperforms fc8-1000. 
This observation, together with the previous in Section [4.3| 
strengthens the thesis that CNNs deliver a higher perfor¬ 
mance in classification tasks when the last layer contains 
one neuron for each class. 

The best accuracy results when reusing information from 
the original fc8 are obtained by adding a new layer, fc9, al¬ 
though they are slightly worse than those obtained with the 
regular fine-tuning (Table [^. At first sight, this observation 
may seem contrary to intuition gained in the layer-wise anal¬ 
ysis, which suggested that a deeper architecture would have 
a better performance. If a holistic view is taken and not only 
the network architecture is considered, we observe that in¬ 
cluding information from the 1,000 classes in ILSVRC 2012 
(e.g. zebra, library, red wine) may not help in sentiment 
prediction, as they are mainly neutral or do not provide any 
sentimental cues without contextual information. 

The reduction in performance when introducing semantic 
concepts that are neutral with respect to sentiment, together 
with the results in Section highlight the importance of 
appropriate mid-level representation such as the Visual Sen¬ 
timent Ontology built in when addressing the task of vi¬ 
sual sentiment prediction. Nevertheless, they suggest that 
generic features such as neural codes in fc7 outperform se¬ 
mantic representations when the latter are not sentiment 
specific. This intuition meets the results in [^, where the 
authors found out that training a classifier using CaffeNeVs 
fc7 instead of fc8 reported better performance for the task 
of visual sentiment prediction. 

5. CONCLUSIONS 

We presented several experiments studying the suitability 
of fine-tuned CNNs for the task of visual sentiment predic¬ 
tion. We showed the utility of deep architectures that are 
capable of capturing high level features when addressing the 
task, obtaining models that outperform the best results so 
far in the evaluation dataset. Data augmentation has been 
demonstrated to be a useful technique for increasing visual 
sentiment prediction accuracy as well. Our study of domain 
transfer from object recognition to sentiment analysis has re¬ 
inforced common good practices in the field: discarding the 
last fully connected layer adapted to another task, and the 
addition of a new randomly initialized layer with as many 
neurons as the amount of categories to classify. 
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